[PATCH] Caching/reusing WWW::RobotRules(::InCore)

--=-WBe+CFWa6yDKakuhfgbY
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

The current behaviour of LWP::RobotUA, when passed in an existing
WWW::RobotRules::InCore object is counterintuitive to me.

I am of this opinion because of the documentation of $rules in
LWP::RobotUA->new() and WWW::RobotRules->agent(), as well as the
implementation in WWW::RobotRules::AnyDBM_File.

Currently, W::R::InCore empties the cache always when agent() is called,
regardless if the agent name changed or not. W::R::AnyDBM_File does not
seem to have this problem.

I suggest applying the attached patch to fix this.

Additionally, I see InCore and AnyDBM_File use a different algorithm for
getting the "short" agent name from the full one, with the AnyDBM_File
looking "older". Perhaps add a new method/function for this (eg.
short_agent()) in WWW::RobotRules that could be used in both InCore and
AnyDBM_File?

While on the robots subject, applying something like the "warning could
be more helpful" change from
http://www.xray.mpe.mpg.de/mailing-lists/libwww-perl/2004-08 /msg00024.html would be most welcome.

--=-WBe+CFWa6yDKakuhfgbY
Content-Disposition: inline; filename=robotrules-agent.patch
Content-Type: text/x-patch; name=robotrules-agent.patch; charset=iso-8859-1
Content-Transfer-Encoding: 7bit

Index: lib/WWW/RobotRules.pm
============================================================ =======
RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
retrieving revision 1.30
diff -a -u -r1.30 RobotRules.pm
--- lib/WWW/RobotRules.pm 9 Apr 2004 15:09:14 -0000 1.30
+++ lib/WWW/RobotRules.pm 12 Oct 2004 06:39:34 -0000
[at] [at] -185,10 +185,12 [at] [at]
# "FooBot/1.2" => "FooBot"
# "FooBot/1.2 [http://foobot.int; foo [at] bot.int]" => "FooBot"

- delete $self->{'loc'}; # all old info is now stale
$name = $1 if $name =~ m/(\S+)/; # get first word
$name =~ s!/.*!!; # get rid of version
- $self->{'ua'}=$name;
+ unless ($old && $old eq $name) {
+ delete $self->{'loc'}; # all old info is now stale
+ $self->{'ua'} = $name;
+ }
}
$old;
}

--=-WBe+CFWa6yDKakuhfgbY--
ville.skytta [ Di, 12 Oktober 2004 08:54 ] [ ID #176602 ]

Re: [PATCH] Caching/reusing WWW::RobotRules(::InCore)

Ville Skyttä <ville.skytta [at] iki.fi> writes:

> The current behaviour of LWP::RobotUA, when passed in an existing
> WWW::RobotRules::InCore object is counterintuitive to me.
>
> I am of this opinion because of the documentation of $rules in
> LWP::RobotUA->new() and WWW::RobotRules->agent(), as well as the
> implementation in WWW::RobotRules::AnyDBM_File.
>
> Currently, W::R::InCore empties the cache always when agent() is called,
> regardless if the agent name changed or not. W::R::AnyDBM_File does not
> seem to have this problem.
>
> I suggest applying the attached patch to fix this.

Applied. Will be in 5.801.

Regards,
Gisle


> Index: lib/WWW/RobotRules.pm
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
> retrieving revision 1.30
> diff -a -u -r1.30 RobotRules.pm
> --- lib/WWW/RobotRules.pm 9 Apr 2004 15:09:14 -0000 1.30
> +++ lib/WWW/RobotRules.pm 12 Oct 2004 06:39:34 -0000
> [at] [at] -185,10 +185,12 [at] [at]
> # "FooBot/1.2" =3D> "FooB=
ot"
> # "FooBot/1.2 [http://foobot.int; foo [at] bot.int]" =3D> "FooB=
ot"
>
> - delete $self->{'loc'}; # all old info is now stale
> $name =3D $1 if $name =3D~ m/(\S+)/; # get first word
> $name =3D~ s!/.*!!; # get rid of version
> - $self->{'ua'}=3D$name;
> + unless ($old && $old eq $name) {
> + delete $self->{'loc'}; # all old info is now stale
> + $self->{'ua'} =3D $name;
> + }
> }
> $old;
> }
gisle [ Fr, 12 November 2004 17:15 ] [ ID #480290 ]
Perl » perl.libwww » [PATCH] Caching/reusing WWW::RobotRules(::InCore)

Vorheriges Thema: Patch for WWW::RobotsRules.pm
Nächstes Thema: Problems using Mechanize to signon to AT&T Wireless